import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
# for Box-Cox Transformation
from scipy import stats
# for min_max scaling
from mlxtend.preprocessing import minmax_scaling
# plotting modules
import seaborn as sns
import missingno
import matplotlib.pyplot as plt
import scipy.stats as stats
Objectives¶
What is the most important factor in determining survival of the Titanic incident?
In the movie, the upper-class passengers were given preference on lifeboats. Does this show in the data?
“Women and children first”. Was this the case?
Add one other observation that you have noted in the dataset.
#Using read_csv() to read 'Titanic.csv'
df_titanic = pd.read_csv('Titanic.csv')
#Checking if there is null values in the columns and it's type
df_titanic.info()
#Storing the amount of null values that are found in each columns
null_count = df_titanic.isnull().sum()
#Fitlering null_count vairable to only show the what columns are null
null_count = null_count[null_count > 0]
print(f"\nMissing data found in the columns (Other columns have no null values)\n{null_count}")
#Shows the first 10 of the dataframe
df_titanic.head(10)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB Missing data found in the columns (Other columns have no null values) Age 177 Cabin 687 Embarked 2 dtype: int64
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
| 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
Dropping the column 'Name','Ticket','Cabin' as you can't use any data analytic technique that is relevant to the obejectives that is listed. Also the 'Name','Ticket','Cabin' column doens't relevance to the other columns as there no direct effect on objectives.
#Dropping the column 'Name', 'Ticket' and 'Cabin'
df_titanic.drop('Name', axis=1, inplace=True)
df_titanic.drop('Ticket', axis=1, inplace=True)
df_titanic.drop('Cabin', axis=1, inplace=True)
#Showing updated dataframe with the droped column
df_titanic.head()
| PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
| 1 | 2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
| 2 | 3 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
| 3 | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
| 4 | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
Understanding the data¶
What type of varibles are found in the columns?¶
The table below shows all the column that is found in the data frame. Each of the column explains what type the variable is and a brief explanation of what each of those variables mean.
| Varible Name | Variable Type | Description |
|---|---|---|
| PassengerID | Index | Unique identifier for each of the passenger in the data frame |
| Survived | Categorical binary | 'Survived' relates to whether that passenger had survived the event (e.g value=1 is Yes, value=0 is No) |
| Pclass | Ordinal categorical | 'Passenger Class' are values that relate to the level of service that the passenger recieve (e.g hierarchy (1st > 2nd > 3rd) in passenger classes on the Titanic) |
| Sex | Categorical | 'Sex' relates whether the passenger is a 'male' or a 'female' |
| Age | Categorical | 'Age' relates how old was the passenger during the event |
| SibSp | Discrete numerical | (SibSp = siblings or spouses). The amount of siblings or spouses during the event |
| Parch | Discrete numerical | (Parch = parents or children). The amount of parents or children during the event |
| Fare | Continuous | 'Fare' is the amount paid for their ticket/s |
| Embarked | Categorical | 'Embarked' shows which section of the titanic the passenger was located |
How will I be displaying each of the variables?¶
The table below show the available options that will allow me to visually understand the data and to see any anomalies can be found
| Variable Type | Graphical representation |
|---|---|
| Categorical | Pie Chart, Bar Chart |
| Continuous, Discrete | Histogram, Box Plot, Heatmap |
#Checking the unqiue categoiral values that is found in the categoiral values and sorting the value
print(f"\nUnique values for categorical variable\n")
print(f"Survived: \t{df_titanic['Survived'].unique()}")
print(f"Sex: \t\t{df_titanic['Sex'].unique()}")
print(f"Pclass: \t{sorted(df_titanic['Pclass'].unique())}")
print(f"Embarked: \t{df_titanic['Embarked'].unique()}")
Unique values for categorical variable Survived: [0 1] Sex: ['male' 'female'] Pclass: [1, 2, 3] Embarked: ['S' 'C' 'Q' nan]
Explaining unique values for each categorical variable in a dataset¶
The code above is used to find the unique values/names that is used in the dataset for categorical variables
Survived
- Value '0' means causality/death
- Value '1' means survived/alive
Sex
- Value 'male' means the passenger was a male
- Value 'female' means the passenger was a female
Pclass (Passenger class)
- Value '1' means 1st class ticket
- Value '2' means 2nd class ticket
- Value '3' means 3rd class ticket
Embarked
- Value 'S' means Southampton
- Value 'Q' means Queenstown
- Value 'C' means Cherbourg
#Checking the options that is in the 'Survived' column
print(f"Options in the 'Survived' columns {df_titanic['Survived'].unique()}")
#counts the amount of passengers that didn't survived by setting the value 'Survived' == 0
non_survived_count = df_titanic[df_titanic['Survived'] == 0]['Survived'].count()
#counts the amount of passengers that did survived by setting the value 'Survived' == 1
survived_count = df_titanic[df_titanic['Survived'] == 1]['Survived'].count()
#counts the total amount of passenger in the dataframe
total_passengers = df_titanic['PassengerId'].count()
#count the amout of male passenger in the dataframe
total_male = df_titanic[df_titanic['Sex'] == 'male']['Sex'].count()
#count the amount of female passenger in the dataframe
total_female = df_titanic[df_titanic['Sex'] == 'female']['Sex'].count()
#calculating the rate of survival(0.38) and the rate of non survival rate(0.61)
survival_rate = survived_count / total_passengers
non_survival_rate = non_survived_count / total_passengers
survival_data = {
'Survived':survived_count,
'Death':non_survived_count
}
#Figure for pie chart based on the survival percentage of the event
fig_survived = px.pie(df_titanic, values=survival_data.values(), names=survival_data.keys())
#Shows the fig_survived
fig_survived.show()
#Printing out the percentage of both non survival/survival
print(f"The survival percentage: {survival_rate*100:.2f}%\nThe non survival percentage: {non_survival_rate*100:.2f}%")
print(f"The number of people survived is {survival_rate * total_passengers} and {non_survival_rate * total_passengers} passed away")
Options in the 'Survived' columns [0 1]
The survival percentage: 38.38% The non survival percentage: 61.62% The number of people survived is 342.0 and 549.0 passed away
#Calculating the distrubition's skewness for 'fig_age' histogram
skewness_age = stats.skew(df_titanic['Age'].dropna())
#Calculating the mean,mode,median for age found in the dataframe
mean_age = df_titanic['Age'].mean()
mode_age = df_titanic['Age'].mode().tolist()
median_age = df_titanic['Age'].median()
#Printing out the mean, mode, median and skewness of the histogram
print(f"Mean: {mean_age:.2f}\nMode: {mode_age}\nMedian: {median_age}\nSkewness: {skewness_age:.2f}")
#Ploting the histogram for the 'Age' column in the dataframe
fig_age_hist = px.histogram(
df_titanic, #selecting dataframe 'df_titanic'
x="Age" #selecting column 'Age'
)
#Ploting the box plot for the 'Age' column in the dataframe
fig_age_box = px.box(
df_titanic,
y='Age'
)
#Adjusting the size of the histogram and boxplot
fig_age_box.update_layout(height=600, width=800)
fig_age_hist.update_layout(height=600, width=800)
#Showing the histogram for 'Age' column
fig_age_hist.show()
fig_age_box.show()
Mean: 29.70 Mode: [24.0] Median: 28.0 Skewness: 0.39
# Plot graphic of missing values
missingno.matrix(df_titanic, figsize = (30,10))
# Percentage that is missing from the 'Age' column
age_missing_rate = df_titanic['Age'].isnull().sum()/total_passengers
print(f"In the 'Age' column contains about {age_missing_rate*100:.2f}% missing data (Which is {age_missing_rate*total_passengers} out of {total_passengers})")
# Percentage taht is missing from the 'Embarked' column
embarked_missing_rate = df_titanic['Embarked'].isnull().sum()/total_passengers
print(f"In the 'Embarked' column contains {embarked_missing_rate*100:.2f}% missing data (Which is {embarked_missing_rate*total_passengers} out of {total_passengers})")
In the 'Age' column contains about 19.87% missing data (Which is 177.0 out of 891) In the 'Embarked' column contains 0.22% missing data (Which is 2.0 out of 891)
After displaying the missing data you can see that most of the missing values come from the 'Age' column with a few random ones that can be found in 'Embarked'.
Dealing with the missing data that is found in 'Age' and 'Embarked' column¶
Explaining what could have caused the missing data?¶
Age (MNAR): There seems to be a systemic reason for missingness that is found in the 'Age' column as about 19.87% is missing, the chance of 177 isolated events occuring is unlikely to happen. The matrix above doesn't show any visual patterns that can be recognised that would indicate that for that systemic reason.
Embarked (MCAR): The missing matrix above shows there is no systematic reason for those 2 (Which is 2.0 out of 891) data points being missing when being compared to other variables. There are no visual patterns that are found for example there are no entire rows that are missing. Since the missing percentage value is very low which suggests that the likelihood of 2 isolated incidents from occurring is probable.
How should I fill in the missing data?¶
Age column
I will be using the median of 28 to fill in the missing data. The reason below is why I chose the median.
- The distribution is slightly skewed to the right and can be seen in the boxplot & histogram.
- The majority of the data points are clustered towards the lower end of the scale, with some outliers or high values pulling the distribution towards the right.
- The skewness value is 0.39 which indicates a mild positive skew distribution
Embarked column
I will be using the mode of 'S' (Southampton) value as the chances of the 'C' (Cherbourg) is 18.86% and Q (Queenstown) is 8.64% while 'S' has 72.28% therefore is more likely to occur. Also since the missing value is only 2 (0.22%) which shouldn't introduce any major significant biases in the analysis.
# Replacing the missing value in age with value from 'median_age'
df_titanic.loc[df_titanic['Age'].isnull(), 'Age'] = median_age
# Replacing the missing value in embarked
df_titanic.loc[df_titanic['Embarked'].isnull(), 'Embarked'] = 'S'
Objectives¶
- What is the most important factor in determining survival of the Titanic incident?
Since the objective asks for the 'most important factor,' I will take it as a singular variable and not a combination of factors for the best chance of survival, as it doesn't align with the first objective. Using a heatmap will shows which variable has the most impacted on survival.
#Convert categorical variable to values as the heatmap in can't take strings
df_titanic['Sex'] = df_titanic['Sex'].map({
'male':0,
'female':1
})
df_titanic['Embarked'] = df_titanic['Embarked'].map({
'S':0,
'C':1,
'Q':2
})
# Create a heatmap to visualize correlations with 'Survived'
plt.figure(figsize=(10, 8))
sns.heatmap(df_titanic.corr(), annot=True, cmap='rocket', linewidths=0.5)
plt.title('Correlation Heatmap of Titanic Dataset (with Survived)')
plt.show()
Addressing the first objective by using the heatmap to see which variable has the strongest correlation with the 'Survived' column. Directly looking at the 'Survived' column, it shows that the variable 'Sex' has the strongest correlation of 0.54. The correlation coefficient of 0.54 suggests a moderate positive correlation, meaning that chances of survival from this variable alone will influence your chances of survival.
Another reason why I'm using the heatmap is to remove any unnecessary graphs for each of the variables when comparing it to the 'Survived' variable.
#Convert categorical variable to values back to string for 'Sex' column (e.g 0 to 'male')
df_titanic['Sex'] = df_titanic['Sex'].map({
0:'male',
1:'female'
})
#Convert categorical variable to values back to string for 'Embarked' column (e.g 0 to 'S')
df_titanic['Embarked'] = df_titanic['Embarked'].map({
0:'S',
1:'C',
2:'Q'
})
#Creating a new dataframe when the person has survived the titanic event by setting the column 'Survived' == 1
df_survived = df_titanic[df_titanic['Survived'] == 1]
#Showing the tail end of the dataframe of 'df_survived'
df_survived.tail()
| PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
|---|---|---|---|---|---|---|---|---|---|
| 875 | 876 | 1 | 3 | female | 15.0 | 0 | 0 | 7.2250 | C |
| 879 | 880 | 1 | 1 | female | 56.0 | 0 | 1 | 83.1583 | C |
| 880 | 881 | 1 | 2 | female | 25.0 | 0 | 1 | 26.0000 | S |
| 887 | 888 | 1 | 1 | female | 19.0 | 0 | 0 | 30.0000 | S |
| 889 | 890 | 1 | 1 | male | 26.0 | 0 | 0 | 30.0000 | C |
#Creating a new dataframe when the person has deceased in the titanic event by setting the column 'Survived' == 0
df_deceased = df_titanic[df_titanic['Survived'] == 0]
#Showing the tail end of the dataframe of 'df_deceased'
df_deceased.tail()
| PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
|---|---|---|---|---|---|---|---|---|---|
| 884 | 885 | 0 | 3 | male | 25.0 | 0 | 0 | 7.050 | S |
| 885 | 886 | 0 | 3 | female | 39.0 | 0 | 5 | 29.125 | Q |
| 886 | 887 | 0 | 2 | male | 27.0 | 0 | 0 | 13.000 | S |
| 888 | 889 | 0 | 3 | female | 28.0 | 1 | 2 | 23.450 | S |
| 890 | 891 | 0 | 3 | male | 32.0 | 0 | 0 | 7.750 | Q |
#Counting the amount of 'male' that survived in the dataframe of 'df_survived'
amount_survived_m = df_survived[df_survived['Sex'] == 'male']['Sex'].count()
#Counting the amount of 'female' that survived in the dataframe of 'df_survived'
amount_survived_f = df_survived[df_survived['Sex'] == 'female']['Sex'].count()
#Counting the amount of 'male' that has deceased in the dataframe of 'df_deceased'
amount_deceased_m = df_deceased[df_deceased['Sex'] == 'male']['Sex'].count()
#Counting the amount of 'female' that has deceased in the dataframe of 'df_deceased'
amount_deceased_f = df_deceased[df_deceased['Sex'] == 'female']['Sex'].count()
#Printing the amount of people who survived/deceased for male/female
print(f"Male survived: {amount_survived_m}\nMale deceased: {amount_deceased_m}")
print(f"Female survived: {amount_survived_f}\nFemale deceased: {amount_deceased_f}")
#Calculating the rate of survival for male/female using 'df_survived' dataframe
survival_rate_m = amount_survived_m / total_male
survival_rate_f = amount_survived_f / total_female
#Calculating the rate of deceased for male/female using 'df_deceased' dataframe
deceased_rate_m = amount_deceased_m / total_male
deceased_rate_f = amount_deceased_f / total_female
#Printing the rate of people who survived/deceased for male/female
print(f"\nMale survived: {survival_rate_m:.2f}\nMale deceased: {deceased_rate_m:.2f}")
print(f"Female survived: {survival_rate_f:.2f}\nFemale deceased: {deceased_rate_f:.2f}")
Male survived: 109 Male deceased: 468 Female survived: 233 Female deceased: 81 Male survived: 0.19 Male deceased: 0.81 Female survived: 0.74 Female deceased: 0.26
fig_survival_deceased_bar = go.Figure(data= [
go.Bar(name="Male", x=["Survived","Deceased"], y=[amount_survived_m,amount_deceased_m]),
go.Bar(name="Female", x=["Survived","Deceased"], y=[amount_survived_f,amount_deceased_f])
])
fig_survival_deceased_bar.update_layout(barmode='group', title="Survival and Deceased Frequence by Gender")
fig_survival_deceased_bar.show()
The bar graph shows visually the difference in survival rate for male versus female. You are more likely to survive if you were a female. If we had a group of 100 people their chance of survival by gender is as follows
- 19 in 100 for male (approximatley 19% chance)
- 74 in 100 for female (approximatley 74% chance)
In conlcolusion, I would say that 'Sex' is most important factor as it has the strongest correlation when compared to other variables.
Objectives¶
In the movie, the upper-class passengers were given preference on lifeboats. Does this show in the data?¶
Understanding the objective in question I will need to determine how will I know if the passenger had a 'lifeboats' as the data doesn't directly show who had what at the time of the event. I will have to assumed that having a lifeboat will mean that you survived. So I will need to use group by pclass and their survival rate and compare if there is significant difference.
Below displays all the information that forcus on the difference in classes for the following:
- Total survived (1st,2nd,3rd)
- Total amount (1st,2nd,3rd)
- Survival rate (1st,2nd,3rd)
#Using 'df_survived' dataframe to find the total of people survived in first class
survived_first_class = df_survived[df_survived['Pclass'] == 1]['Pclass'].count()
#Using 'df_titanic' dataframe to find the total of people (Survived/Deceased) with first class
total_first_class = df_titanic[df_titanic['Pclass'] == 1]['Pclass'].count()
#Calculating the rate for survival rate for first class passenger
rate_survived_1st = survived_first_class / total_first_class
#Printing the total survived,total passenger and survival rate for first class
print(f"Survived(1st):{survived_first_class} Total(1st):{total_first_class} Rate(1st):{rate_survived_1st:.2f}")
#Using 'df_survived' dataframe to find the total of people survived in second class
survived_second_class = df_survived[df_survived['Pclass'] == 2]['Pclass'].count()
#Using 'df_titanic' dataframe to find the total people (Survived/Deceased) with second class
total_second_class = df_titanic[df_titanic['Pclass'] == 2]['Pclass'].count()
#Calculating the rate of survival rate for second class passenger
rate_survived_2nd = survived_second_class / total_second_class
#Printing the total survived,total passenger and survival rate for second class
print(f"Survived(2nd):{survived_second_class} Total(2nd):{total_second_class} Rate(2nd):{rate_survived_2nd:.2f}")
#Using 'df_survived' datagrame to find the total of people survived in third class
survived_third_class = df_survived[df_survived['Pclass'] == 3]['Pclass'].count()
#Using 'df_titanic' dataframe to find the total people (Survived/Deceased) with third class
total_third_class = df_titanic[df_titanic['Pclass'] ==3]['Pclass'].count()
#Calculating the rate of survival rate for thid class passenger
rate_survived_3rd = survived_third_class / total_third_class
#Printing the total survived,total passenger and survival rate for third class
print(f"Survived(3rd):{survived_third_class} Total(3rd):{total_third_class} Rate(3rd):{rate_survived_3rd:.2f}")
Survived(1st):136 Total(1st):216 Rate(1st):0.63 Survived(2nd):87 Total(2nd):184 Rate(2nd):0.47 Survived(3rd):119 Total(3rd):491 Rate(3rd):0.24
name_class = ["First","Second","Third"]
rate_class = [rate_survived_1st,rate_survived_2nd,rate_survived_3rd]
bar_colour_class = ['green','orange','red']
fig_class_rate = go.Figure(data= [
go.Bar(name="Survival rates", x=name_class, y=rate_class, marker=dict(color=bar_colour_class) )
])
fig_class_rate.update_layout(barmode='group', title="The Survival Rates by 'Passenger Class'")
fig_class_rate.show()
Looking directly into the survival rate of passenger based on class it seems that the first class has the greatest surivial rate, second class passenger came in the middle and third class came last. However, if we just look at the problem directly without any nuance we can come to wrong conlusion. Serveral assumptions have been made
Each group of passengers have a directly propential amount of lifeboats in each of the class group. For example if it was required that each group must have at least 20% reversed lifeboat then the statement above would be true.Since data shows that their were more passenger 3rd class (491 total).
The ship's placement of the lifeboat is also assumed to be equally distrubited from the ship meaning that everyone in each of the group had equal access.
Human error may have occured due to the unexpected conditions which would have to lead certain people to act irrationally. Maybe they had a plan to evacute but The Titanic was famously referred to as the "unsinkable" ship which suggest that wasn't really planning.
There probably more assumptions to be made but if we create a bar graph and see more closer the data it will a reveal a different story.
#Using 'df_survived' dataframe to find the total of people survived in first class
survived_first_class = df_survived[df_survived['Pclass'] == 1]['Pclass'].count()
#Using 'df_titanic' dataframe to find the total of people (Survived/Deceased) with first class
total_first_class = df_titanic[df_titanic['Pclass'] == 1]['Pclass'].count()
#Calculating the rate for survival rate for first class passenger
rate_survived_1st = survived_first_class / total_first_class
#Printing the total survived,total passenger and survival rate for first class
print(f"Survived(1st):{survived_first_class} Total(1st):{total_first_class} Rate(1st):{rate_survived_1st:.2f}")
Survived(1st):136 Total(1st):216 Rate(1st):0.63
#Using 'df_titanic' dataframe to find the total of people Survived with first class (MALE ONLY)
m_sur_1st = df_titanic[
(df_titanic['Pclass'] == 1) &
(df_titanic['Sex'] == "male") &
(df_titanic['Survived'] == 1)
]['Pclass'].count()
#Using 'df_titanic' dataframe to find the total of people Survived with first class (FEMALE ONLY)
f_sur_1st = df_titanic[
(df_titanic['Pclass'] == 1) &
(df_titanic['Sex'] == "female") &
(df_titanic['Survived'] == 1)
]['Pclass'].count()
#Using 'df_titanic' dataframe to find the total of people Survived with second class (MALE ONLY)
m_sur_2nd = df_titanic[
(df_titanic['Pclass'] == 2) &
(df_titanic['Sex'] == "male") &
(df_titanic['Survived'] == 1)
]['Pclass'].count()
#Using 'df_titanic' dataframe to find the total of people Survived with second class (FEMALE ONLY)
f_sur_2nd = df_titanic[
(df_titanic['Pclass'] == 2) &
(df_titanic['Sex'] == "female") &
(df_titanic['Survived'] == 1)
]['Pclass'].count()
#Using 'df_titanic' dataframe to find the total of people Survived with third class (MALE ONLY)
m_sur_3rd = df_titanic[
(df_titanic['Pclass'] == 3) &
(df_titanic['Sex'] == "male") &
(df_titanic['Survived'] == 1)
]['Pclass'].count()
#Using 'df_titanic' dataframe to find the total of people Survived with third class (FEMALE ONLY)
f_sur_3rd = df_titanic[
(df_titanic['Pclass'] == 3) &
(df_titanic['Sex'] == "female") &
(df_titanic['Survived'] == 1)
]['Pclass'].count()
name_class = ["First","Second","Third"]
bar_colour_class = ['green','orange','red']
# Create a figure with two bars
fig_class_rate = go.Figure(data=[
go.Bar(name="Female (1st)", x=name_class, y=[f_sur_1st, 0, 0], marker=dict(color=bar_colour_class[0])),
go.Bar(name="Male (1st)", x=name_class, y=[m_sur_1st, 0, 0], marker=dict(color=bar_colour_class[0])),
go.Bar(name="Female (2nd)", x=name_class, y=[0, f_sur_2nd, 0], marker=dict(color=bar_colour_class[1])),
go.Bar(name="Male (2nd)", x=name_class, y=[0, m_sur_2nd, 0], marker=dict(color=bar_colour_class[1])),
go.Bar(name="Female (3rd)", x=name_class, y=[0, 0, f_sur_3rd], marker=dict(color=bar_colour_class[2])),
go.Bar(name="Male (3rd)", x=name_class, y=[0, 0, m_sur_3rd], marker=dict(color=bar_colour_class[2]))
])
fig_class_rate.update_layout(barmode='group', title="The Survival Total by 'Passenger Class' and 'Sex'")
fig_class_rate.show()
If the statement "upper-class passengers were given preference on lifeboats" was true then there should be more males from the second class should have a higher value than third class since second class is higher than third in terms of status. However this is not true as these statement below are true and contradict the narrative
- Female in third class was slightly greater than second class but overall similiar showing no preference
- Male in third class was more greater than third class approximately x2.7 suggesting a preference for third class male over second class male
But the overview of the bar graph "The Survival Total by 'Passenger Class' and 'Sex'" shows that distrubtion of passenger in the lifeboat when comparing class is even. If the statement was true I would expect to see a desecending order from first till third with significant difference in values. In conclusion, I reject the statement as their isn't enought evidence to suggest that preference was given to certain class of people based on the assumption that was made and the data presented.
Objective¶
“Women and children first”. Was this the case?¶
The statement means that women and children had a higher priority than men in a case of emergency. The following assumptions that I will make to answer for the case presented
- Children equivalates to passenger under the age of 18
- Women means a female that is mature and is between the age from 18 to 99 years old
- Men means a male that is mature and is between the age from 18 to 99 years old
If the case is true then I'm expected to see the survival rate for men to be significantly lower than for female and children.
After calculating the rates the statement is true as men did sacrifice them self as their figures shows below
- Men (16.57%) < Children (53.98%)
- Men (16.57%) < Women (75.29%)
You can also see this in the bar chart as the 'deceased' bar in men is greater then 'children' and 'women'. In conclusion the case for "Women and childern first" was true.
#Counting the amount of people who are under 18 and survived
children_survived = df_titanic[
(df_titanic['Age'] < 18) &
(df_titanic['Survived'] == 1)
]['PassengerId'].count()
#Counting the amount of people who are under 18 and died
children_deceased = df_titanic[
(df_titanic['Age'] < 18) &
(df_titanic['Survived'] == 0)
]['PassengerId'].count()
#Counting the amount of people who are above 18, survived and is a female
women_survived = df_titanic[
(df_titanic['Age'] >= 18) &
(df_titanic['Survived'] == 1) &
(df_titanic['Sex'] == 'female')
]['PassengerId'].count()
#Counting the amount of people who are above 18, died and is a female
women_deceased = df_titanic[
(df_titanic['Age'] >= 18) &
(df_titanic['Survived'] == 0) &
(df_titanic['Sex'] == 'female')
]['PassengerId'].count()
#Counting the amount of people who are above 18, survived and is a male
men_survived = df_titanic[
(df_titanic['Age'] >= 18) &
(df_titanic['Survived'] == 1) &
(df_titanic['Sex'] == 'male')
]['PassengerId'].count()
#Counting the amount of people who are above 18, died and is a male
men_deceased = df_titanic[
(df_titanic['Age'] >= 18) &
(df_titanic['Survived'] == 0) &
(df_titanic['Sex'] == 'male')
]['PassengerId'].count()
#Storing the count for the total of survived 'children', 'women' and 'men'
survived_data = [children_survived, women_survived, men_survived]
#Storing the count for the total of deceased 'children', 'women' and 'men'
deceased_data = [children_deceased, women_deceased, men_deceased]
#The x-axis on the bar chart which names are 'Children','Women' and 'Men'
age_group_name = ['Children', 'Women', 'Men']
#Colour selector for the bars in the bar chart (e.g bar_colour_class[0] = green)
bar_colour_class = ['green', 'red']
# Creating the bar chart for to display the information for 'survived_data' and 'deceased_data' by the 'age_group_name'
fig_priority = go.Figure(data=[
go.Bar(name="Survived", x=age_group_name, y=survived_data, marker=dict(color=bar_colour_class[0])),
go.Bar(name="Deceased", x=age_group_name, y=deceased_data, marker=dict(color=bar_colour_class[1])),
])
#Setting the barmode='group' and changing the title of the graph
fig_priority.update_layout(barmode='group', title="The Survival/Deceased Total Based on Age and Gender")
#Showing the "The Survival/Deceased Total Based on Age and Gender"
fig_priority.show()
#Calculating the survival rate for 'Children', 'Women' and 'Men'
survival_rate_children = children_survived / (children_survived+children_deceased)
survival_rate_women = women_survived / (women_survived+women_deceased)
survival_rate_men = men_survived / (men_survived+men_deceased)
print(f"The Survival Rate\nChildren: {survival_rate_children:.2f} ({survival_rate_children*100:.2f}%)")
print(f"Women: {survival_rate_women:.2f} ({survival_rate_women*100:.2f}%)")
print(f"Men: {survival_rate_men:.2f} ({survival_rate_men*100:.2f})%")
The Survival Rate Children: 0.54 (53.98%) Women: 0.75 (75.29%) Men: 0.17 (16.57)%
min_age = df_survived['Age'].min()
max_age = df_survived['Age'].max()
print(min_age,max_age)
0.42 80.0
# Filter data for passengers aged 0-80
df_filtered = df_survived[df_survived['Age'].between(0, 80)]
# Create age groups with a width of 10
df_filtered['Age Group'] = pd.cut(df_filtered['Age'], bins=range(0, 81, 10), right=False)
# Plot using Seaborn
plt.figure(figsize=(10, 6))
sns.countplot(x='Age Group', hue='Survived', data=df_filtered, palette=['red', 'green'])
plt.title('Survival Comparison of Younger and Older Passengers on Titanic')
plt.xlabel('Age Group')
plt.ylabel('Number of Passengers')
plt.legend(title='Survival Status', labels=['Survived'])
plt.show()
/var/folders/f_/__7nfmqd5nx94gnz29h4xjh40000gn/T/ipykernel_38549/96846015.py:9: UserWarning: The palette list has more values (2) than needed (1), which may not be intended.
For the graph alone I'm using the that younger people tend to survive more than those who are much older
Reference
QUCLA (2021). What is the difference between categorical, ordinal and interval variables? [online] stats.oarc.ucla.edu. Available at: https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-interval-variables/.
kaggle.com. (n.d.). Titanic - Machine Learning from Disaster. [online] Available at: https://www.kaggle.com/competitions/titanic/data.